Integrating A Lexical Database And A Training Collection For Text Categorization

نویسندگان

Jose Maria Gomez-Hidalgo

Manuel De Buenaga Rodriguez

چکیده

Automatic text categorization is a complex and useful task for manynatural language processing applications. Recent approaches to textcategorization focus more on algorithms than on resources involved in thisoperation. In contrast to this trend, we present an approach based on the integration of widely available resources aslexical databases and training collections to overcome current limitationsof the task. Our approach ~ makes use of WordNet synonymy information toincrease evidence for bad trained categories. When testing a direct categorization, a WordNet basedone, a training algorithm, and our integrated approach, the latter exhibitsa better perfomance than any of the others. Incidentally, WordNet based approach perfomance is comparable with the trainingapproach one. 1 I n t r o d u c t i o n Text categorization (TC) is the classification ofdocuments with respect to a set of one or more pre-existing categories. TCis a hard and very useful operation frequently applied to the assignment of subject categories to documents, toroute and filter texts, or as a part of natural language processingsystems. In this paper we present an automatic TC approach based on theuse of several linguistic resources. Nowadays, many resources like trainingcollections and lexical databases have been successfully employed for text classificationtasks [Boguraev and Pustejovsky, 1996], but always in an isolated way. Thecurrent trend in the TC field is to pay more attention to algorithms thanto resources. We believe that the key idea for the improvement of text categorization is increasing theamount of information a system makes use of, through the integration ofseveral resources. We have chosen the Information Retrieval vector space model for ourapproach. Term weight vectors are computed for documents and categoriesemploying the lexical database WordNet and the training subset of the testcollection Reuters-22173. We calculate the weight vectors for: 1 This research is supported by the Spanish Commttee of Sctence andTechnology (CICYT TIC94-0187). _ A direct approach, _ a Wordnet based approach, _ a training collection approach, _ and finally, a technique for integrating WordNet and a training collection. Later, we compare document-category similarity by means of a cosine-basedfunction. We have driven a series of experiments on the test subset of Reuters22173, which yields two conclusions. First, the integrated approach performs better than any of the other ones, confirming thehypothesis that the more informed a text classification system is, thebetter it performs. Secondly, the lexical database oriented technique can rival with the training approach, avoiding the necessity ofcost-expensive building of training collections for any domain andclassification task. 2 T a s k D e s c r i p t i o n Given a set of documents and a set of categories, the goal of acategorization system is to decide whether any document belongs to anycategory or not. The system makes use of the information contained in adocument to compute a degree of pertainance of the document to each category. Categories are usually subject labels likeart or military, but other categories like text genres are also interesting[Karlgren and Cutting, 1994]. Documents can be news stories, emailmessages, reports, and so forth. The most widely used resource for TC is the training collection. Attaining collection is a set of manually classified documents that allowsthe system to guess clues on how to classify new unseen documents. Thereare currently several TC test collections, from which a training subset and a test subset can be obtained. Forinstance, the huge TREC collection [Harman, 1996], OHSUMED [Hersh etal, 1994] and Reuters-22173 [Lewis, 1992] have been collected for thistask. We have selected Reuters because it has been used in other work,facilitating the comparison of resuits. Lexical databases have been rarely employed in TC, but severalapproaches have demonstrated their usefulness for term classification operations like word sense disambiguation[Resnik, 1995; Agirre and Rigau, 1996]. A lexical database is a referencesystem that accumulates information on the lexical items of one o

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Integrating a Lexical Database and a Training Collection for Text Categorization

متن کامل

Using WordNet to Complement Training Information in Text Categorization

Automatic Text Categorization (TC) is a complex and useful task for many natural language applications, and is usually performed through the use of a set of manually classified documents, a training collection. We suggest the utilization of additional resources like lexical databases to increase the amount of information that TC systems make use of, and thus, to improve their performance. Our a...

متن کامل

L2 Learners’ Lexical Inferencing: Perceptual Learning Style Preferences, Strategy Use, Density of Text, and Parts of Speech as Possible Predictors

This study was intended first to categorize the L2 learners in terms of their learning style preferences and second to investigate if their learning preferences are related to lexical inferencing. Moreover, strategies used for lexical inferencing and text related issues of text density and parts of speech were studied to determine their moderating effects and the best predictors of lexical infe...

متن کامل

An Intelligent Personalized Service for Conference Participants

This paper presents the integration of linguistic knowledge in learning semantic user profiles able to represent user interests in a more effective way with respect to classical keyword-based profiles. Semantic profiles are obtained by integrating a näıve Bayes approach for text categorization with a word sense disambiguation (WSD) strategy based on the WordNet lexical database (Section 2). Sem...

متن کامل

The Role of Word Sense Disambiguation in Automated Text Categorization

Automated Text Categorization has reached the levels of accuracy of human experts. Provided that enough training data is available, it is possible to learn accurate automatic classifiers by using Information Retrieval and Machine Learning Techniques. However, performance of this approach is damaged by the problems derived from language variation (specially polysemy and synonymy). We investigate...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1997

Integrating A Lexical Database And A Training Collection For Text Categorization

نویسندگان

چکیده

منابع مشابه

Integrating a Lexical Database and a Training Collection for Text Categorization

Using WordNet to Complement Training Information in Text Categorization

L2 Learners’ Lexical Inferencing: Perceptual Learning Style Preferences, Strategy Use, Density of Text, and Parts of Speech as Possible Predictors

An Intelligent Personalized Service for Conference Participants

The Role of Word Sense Disambiguation in Automated Text Categorization

عنوان ژورنال:

اشتراک گذاری